We start off, by fetching the data from wineQualityReds csv file and storing into a variable wineQualityData.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Data: We have 1599 rows of data where X is the unique identifier for each wine. There are 11 metrics which decide the quality of the wine. Quality is an ordered variable where values range from 3 to 8 for our given sent of wines. The mean wine quality is 5.6.
We start off by exploring univariate variables to find correlations between the attributes and the quality of a wine.
Majority of the wines have quality between 5 and 6 with very few wines being really good or bad (8 or 3 respectively).
We analyze each property individually.
From the above plot, it appears that majority of the values for fixed acidity lie in the range 5 to 14. So we limit our fixed acidity values to this range.
The median for fixed acidity is somewhere around 8 and the graph is positively skewed. Large number of values lie in the range of 7 to 9.
Majority values for volatile acidity lie in the range of 0.2 to 1.
The median is around .54 and this distribution is also positively skewed.
A lot of citric acid values appear to be zero. The data available for citric acid might be incomplete.
The graph for residual sugar is heavily skewed towards the left and most of the data lies in the range 1 to 5.
Even after filtering some outliers, the data is still positively skewed with a median around 2.25.
The data for chlorides is similar to that of residual sugar. We consider the data that lies between 0.04 and 0.14.
The data for this range appears to be normally distributed with a few outliers. The median is around 0.08.
Most of the values for free sulfur dioxide lie in the range of 0 to 35.
In this property we see a high peak around 7-8 which gives our graph a positive skew. The median, however, is around 13. This is becuase of the long tail of values in the high range.
Most of the values are in the range 0 to 100. Since free sulfur dioxide is a subset of total sulfur dioxide, we can expect to see a similar positively skewed graph for total sulfur dioxide.
Our expectation was correct in this case, we see a positively skewed graph with a high peak around 25 whereas the median is around 36. We can say that the values for total sulfur dioxide are somewhat proportional to those free sulfur dioxide.
The data for density is normally distributed.
Both the median and the mean appear to be around 0.997. So we can positively say that our plot is normally distributed.
The data for pH level is also normally distributed.
Both the median and the mean appear to be around 3.3. So we can positively say that our plot is normally distributed.
In this case we put our limits at 0.3 and 1.
Most of the alcohol percentage is around 9 to 11%, which is normal and a few values goind till 13.
This graph is positively skewed with a median around 10.2, which is normal beacuse most of the wines have their alcohol percentange in 9% to 11% range.
In univariate data analysis we observed that many values for citric acid are zero, which indicates that the data might be incomplete. Many properties like fixed acidity, volatile acidity and alcohol content tend to have positive skews, which might be useful in the later part of our analysis. Another important point to note here is that total sufur dioxide and free sulfur dioxide are somewhat correlated to each other.
From the above plots we can see that wines with higher quality have low median density. We can see a negative correlation between quality and density of a wine.
Higher quality wines in the dataset have higher alcohol content on average as compared to the lower quality ones. There is a positive correlation between alcohol and quality.
Wines are generally acidic in nature which explains that almost all pH levels are below 7 (which is neutral). We can observe that most wines have pH level within range 3 to 4, and there is a slight negative correlation.
There are many outliers for the residual sugar property. Let’s filter out the outliers and plot the values.
The residual sugar content is almost the same for all qualities of wine.
Loooks like even suplhates has a lot of outliers, however we can observe a positive correlation from the boxplot. Let’s have a closer look.
Yes, our observation was correct , better quality wines have higher sulphates content.
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## log10.residual.sugar log10.chlordies free.sulfur.dioxide
## 0.02353331 -0.17613996 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## log10.sulphates alcohol
## 0.30864193 0.47616632
From the above values we can say that alcohol, volatile acidity and sulphates have higher correlation with the qualtiy. We already observed that alcohol and sulphates have positive correlation with quality. Let’s have a look at volatile acidity vs quality.
Volatile Acidity has a strong negative correlation wrt wine quality.
In the previous section we observed what properties have direct effect on the quality of wines. Let’s have a look at how combinations of these factors affect the quality.
The above graph shows that wines with higher alcohol content and lower volatile acidity tend to have higher quality rating.
Good quality wines tend to have lower sulphates level. Based on the past two observations we can expect a graph of sulphates and volatile acidity to have good quality wines to be prevalent in the bottom left of the graph. Let’s have a look.
This graph stays true to our expectation. A lot of good quality wines lie in the bottom left of the graph.
This graph shows us a strong negative correlation between wine quality and volatile acidity. Better the wine quality, lower the volatile acidity in it.
We observed that alcohol content has a strong postivie correlation with respect to quality. The following graph depicts that.
The above plots help us understand that Volatile acidity and alcohol are the major properties that affect the quality of a wine. There are other factors like density, pH level and sulphates that also affect wine quality to some extent.
We were able to figure some properties that might be affecting the quality of a wine. However our dataset only had 1599 different wines, which were produced in a certain region of Portugal, which is much less than the large number of wines that are available in the market. Therefore our analysis need not necessarily apply to wines made in other countries. We also need to understand that the dataset was created by fixed group of individuals and since the taste differs from person to person, the ratings provided by this fixed group of individuals need not necessarily apply to the entire populace.
While analyzing this dataset, I first started off with analyzing all the properties against quality. This helped me find out if there are any loopholes in the dataset. Citric acid is one such property which had a lot of zero values. This leads to believe that the data might be incomplete. This univariate analysis also helped me find out which proerties have a positive skew on quality. It also helpedus find the distribution of most of the values.
While analyzing this dataset, I came across a lot of outlier values. This required me to drill down into each dataset and look for insights. At each stage we narrowed down to some key properties that had an effect on the quality of wine. This, I think was a good way to approach the problem because it helped me figure out which properties to focus on.